According to the American Stroke Association, strokes are the 5th leading cause of death in the United States and can be debilitating as they impact an individual's ability to get the necessary oxygen and nutrients into their brain- leading brain damage, and ultimaetly, death. However, despite the severity of strokes, 80% of the strokes that occur are actually preventable through proper health decisions.
The data I am utilizing I found at https://www.kaggle.com/datasets/prosperchuks/health-dataset?select=stroke_data.csv. The data was collected by the CDC's Behavorial Risk Factor Surveillance System and was collected in 2015. According to the CDC, the data was collected through a survey asked to an American individual in a household over the telephone. The data depicts whether or not an individual had a stroke along with other attributes of an individual including their age, gender, BMI, marriage-status, and whether or not they have other medical conditions such as hypertension. The data can be useful in determining the factor(s) that potentially increase the probability of an individual suffering from a stroke. The data could allow research scientist and medical professionals to know which combinations of factors more greatly contribute to a stroke so that they can provide medical advice and data-driven reccomendations to patients so they can make life decisions that could prevent a stroke. Furthermore, the CDC explicitly states that the dataset was created to glean insights that could allow the identification of impactful health behaviors, identify and address health issues, and ultimately, propose legislative health actions that contribute to better achieving health-objectives in each state.
The prediction-goal pertaining to my data is to determine whether or not an individual is more/less likely to suffer from a stroke. Thus, given an individual's BMI, gender, age range, resting heart rate whether they smoke, etc..., A computer should be able to determine whether a stroke is more/less likely in one individual compared to others. The third parties that might be interested in my study include health-organizations like the CDC and WHO, as they can update medical guidance regarding stroke prevention on their website based on the conclusion drawn from analyzing this data set. Furthermore, doctors and other medical professionals might be interested in my data as it can allow them to determine which of their patients might be more at risk of a stroke, and thus, can allow medical professionals to make data-driven conclusions regarding how to help reduce a patients risk of a stroke. Furthermore, doctors can glean insights regarding the key factor(s) that make strokes more likely and provide reccomendations for patients. Perhaps they should be provided medications to quit smoking? Or perhaps an individual should be encouraged to do more exercise? Or maybe a combination of reccomendations can be reccomended to a patient.
https://www.stroke.org/en/about-stroke
https://www.cdc.gov/brfss/annual_data/2015/pdf/overview_2015.pdf
Determining Algorithm Success
My algorithm's measure of success will be determined by the ability to properly predict whether an individual attributes such as BMI, gender, age, etc..., make them more likely than not to get a stroke. This prediction will be derived by patterns gleaned by trainiing the algorithm on the given dataset which includes past individuals, their attributes, and whether or not they suffered a stroke. Obviously, a successful algorithm must achieve a success rate >50% as an algorithm that correctly predicts only 50% of the time is merely guessing. Furthermore, considering that this analysis is related to the medical field and patient wellbeing, a success rate should be far higher than other machine leanring applications. This is because if we are wrong, a doctor might not consider a patient at-risk, and thus, a patient might make lifestyle decisions that could induce a stroke to occur. In other applications such as song-prediction in Spotify, the stakes are a lot lower. If a machine learning algorithm reccomends the wrong song, it is merely a small nuisance.
It is also important to consider the type of error my algorithm makes. The two potential type of errors are false positives and false negatives. Thus, my algorithm can either state an individual is stroke-prone, when in reality, they will be fine, or the algorithm can state an individual is not stroke-prone, when in reality, they will not be fine. A false positive (Type II Error) could lead an individual to take medication to reduce risk of stroke like statins. These medications could have side effects such as headache, muscle pain, but are not life-threatening. Furthermore, an individual may "waste time" on activities like frequent exercise and specific dieting to reduce stroke risk when it is not necessary. However, again, this is not life-threatening. If an inidivual were to be stroke prone but not know about it, they could make decisions that lead to permanent injury/death in the future. In summary, it is essential that my algorithm is tuned in a manner where the mistakes it makes are almost all Type II errors.
The above business understanding and algorithm success statement describes what we would do with this data in the future-- to create a predictive algorithm that can determine a person's risk of having a stroke. However for the purposes of this lab, we will make headway towards this goal by understanding, interpreting, and analyzing the stroke data set. Some of our more specific goals for Lab 1 include:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
# raw dataset csv, in kaggle there is a link to the github page that contains this
df = pd.read_csv("https://gist.githubusercontent.com/aishwarya8615/d2107f828d3f904839cbcb7eaa85bd04/raw/cec0340503d82d270821e03254993b6dede60afb/healthcare-dataset-stroke-data.csv")
each_individual_unique_bool = df["id"].is_unique
print("")
print("the premise that each individual id the DF is unique is,", each_individual_unique_bool)
print("")
# deleting the ID feature because each person's ID number has no correlation to their chances of having a stroke
del df["id"]
# display feature list, along with their data types
df.info()
the premise that each individual id the DF is unique is, True <class 'pandas.core.frame.DataFrame'> RangeIndex: 5110 entries, 0 to 5109 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 5110 non-null object 1 age 5110 non-null float64 2 hypertension 5110 non-null int64 3 heart_disease 5110 non-null int64 4 ever_married 5110 non-null object 5 work_type 5110 non-null object 6 Residence_type 5110 non-null object 7 avg_glucose_level 5110 non-null float64 8 bmi 4909 non-null float64 9 smoking_status 5110 non-null object 10 stroke 5110 non-null int64 dtypes: float64(3), int64(3), object(5) memory usage: 439.3+ KB
When reading in the file, we first want to determine if there are any duplicate rows (i.e: an individual is repeated). Since we are provided an ID identifying each individual, we find that the ID column has no repeats.
df.head()
| gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Male | 67.0 | 0 | 1 | Yes | Private | Urban | 228.69 | 36.6 | formerly smoked | 1 |
| 1 | Female | 61.0 | 0 | 0 | Yes | Self-employed | Rural | 202.21 | NaN | never smoked | 1 |
| 2 | Male | 80.0 | 0 | 1 | Yes | Private | Rural | 105.92 | 32.5 | never smoked | 1 |
| 3 | Female | 49.0 | 0 | 0 | Yes | Private | Urban | 171.23 | 34.4 | smokes | 1 |
| 4 | Female | 79.0 | 1 | 0 | Yes | Self-employed | Rural | 174.12 | 24.0 | never smoked | 1 |
Dataset Overview
Considering all the attributes, we discussed whether or not any of the above attributes are important. At first, we considered deleting first the ever_married attribute as we did not understand how that could potentially have any relation to the presence of a stroke. However, after considering the fact that marriage can significantly impact one's lifestyle and introduce children as stressors in family relationships, we realized that a person's marital status actually infers a lot of characteristics about an individual that could potentially be factors that induce a stroke. Thus, it is worth keeping it as an attribute to explore.
df.describe()
| age | hypertension | heart_disease | avg_glucose_level | bmi | stroke | |
|---|---|---|---|---|---|---|
| count | 5110.000000 | 5110.000000 | 5110.000000 | 5110.000000 | 4909.000000 | 5110.000000 |
| mean | 43.226614 | 0.097456 | 0.054012 | 106.147677 | 28.893237 | 0.048728 |
| std | 22.612647 | 0.296607 | 0.226063 | 45.283560 | 7.854067 | 0.215320 |
| min | 0.080000 | 0.000000 | 0.000000 | 55.120000 | 10.300000 | 0.000000 |
| 25% | 25.000000 | 0.000000 | 0.000000 | 77.245000 | 23.500000 | 0.000000 |
| 50% | 45.000000 | 0.000000 | 0.000000 | 91.885000 | 28.100000 | 0.000000 |
| 75% | 61.000000 | 0.000000 | 0.000000 | 114.090000 | 33.100000 | 0.000000 |
| max | 82.000000 | 1.000000 | 1.000000 | 271.740000 | 97.600000 | 1.000000 |
# This code is from Eric Larson's MachineLearningNotebooks repo
# We are untilizing it here to show missing data
# Dr. Larson's repo can be found at https://github.com/eclarson/MachineLearningNotebooks
!pip install missingno
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
# External package: conda install missingno
import missingno as mn
mn.matrix(df)
plt.title("Missing Data Visualization",fontsize=22)
plt.figure()
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: missingno in c:\users\lawrence\appdata\roaming\python\python39\site-packages (0.5.1) Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (from missingno) (1.23.5) Requirement already satisfied: seaborn in c:\programdata\anaconda3\lib\site-packages (from missingno) (0.12.2) Requirement already satisfied: scipy in c:\programdata\anaconda3\lib\site-packages (from missingno) (1.9.3) Requirement already satisfied: matplotlib in c:\programdata\anaconda3\lib\site-packages (from missingno) (3.6.2) Requirement already satisfied: pillow>=6.2.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->missingno) (9.3.0) Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->missingno) (22.0) Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->missingno) (2.8.2) Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->missingno) (4.25.0) Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->missingno) (0.11.0) Requirement already satisfied: pyparsing>=2.2.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->missingno) (3.0.9) Requirement already satisfied: kiwisolver>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->missingno) (1.4.4) Requirement already satisfied: contourpy>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->missingno) (1.0.5) Requirement already satisfied: pandas>=0.25 in c:\programdata\anaconda3\lib\site-packages (from seaborn->missingno) (1.4.4) Requirement already satisfied: pytz>=2020.1 in c:\programdata\anaconda3\lib\site-packages (from pandas>=0.25->seaborn->missingno) (2022.7) Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0)
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
df.isna().sum()
gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 201 smoking_status 0 stroke 0 dtype: int64
The missing data visualization and the sum of NA values per attribute show that the only values missing come from the BMI column, and no other attributes have missing information. Considering this, we must ascertain whether or not the individuals missing BMI have any particular traits in common. For instance, perhaps they are all old individuals. If this is the case, this would be significant as it could allow us to decipher whether or not there is a reason behind these missing BMI values. Was it just the individual completing the survey just happened not to know their BMI? Or perhaps doctors feel that measuring BMI for elderly patients is not needed as BMI is not always an accurate factor for the elderly? These are examples of legitimate reasons for leaving BMI out.
df_subset_where_bmi_is_null= df[df["bmi"].isnull()] #Creating a subset consisting of only individuals with no BMI
#finding the % of individuals in the subset who are female. Comparing with the % of female individuals in full set to see if BMI subset is skewed in some fashion.
count_subset_female=len(df_subset_where_bmi_is_null[df_subset_where_bmi_is_null['gender']=="Female"])
count_fullset_female=len(df[df['gender']=="Female"])
print('The % of females in the subset is', ((count_subset_female/len(df_subset_where_bmi_is_null))*100))
print('The % of females in the full set is', ((count_fullset_female/len(df))*100))
#doing the same comparison above, except regarding those who suffer from hypertension
count_subset_hypertension=len(df_subset_where_bmi_is_null[df_subset_where_bmi_is_null['hypertension']==1])
count_fullset_hypertension=len(df[df['hypertension']==1])
print('The % of individuals with hypertension in the subset is:', ((count_subset_hypertension/len(df_subset_where_bmi_is_null))*100))
print('The % of individuals with hypertension in the full set is:', ((count_fullset_hypertension/len(df))*100))
#Doing the same type of comparison above, except regarding % of those who suffer from heart disease.
count_subset_heart_disease=len(df_subset_where_bmi_is_null[df_subset_where_bmi_is_null['heart_disease']==1])
count_fullset_heart_disease=len(df[df['heart_disease']==1])
print('The % of individuals with heart diease in the subset is:', ((count_subset_heart_disease/len(df_subset_where_bmi_is_null))*100))
print('The % of individuals with heart disease in the full set is:', ((count_fullset_heart_disease/len(df))*100))
#Doing the same type of comparison above, except regarding proportion of individuals who got a stroke.
count_subset_stroke=len(df_subset_where_bmi_is_null[df_subset_where_bmi_is_null['stroke']==1])
count_fullset_stroke=len(df[df['stroke']==1])
print('The % of individuals with stroke in the subset is:', ((count_subset_stroke/len(df_subset_where_bmi_is_null))*100))
print('The % of individuals with stroke in the full set is:', ((count_fullset_stroke/len(df))*100))
The % of females in the subset is 48.258706467661696 The % of females in the full set is 58.590998043052835 The % of individuals with hypertension in the subset is: 23.383084577114428 The % of individuals with hypertension in the full set is: 9.74559686888454 The % of individuals with heart diease in the subset is: 16.417910447761194 The % of individuals with heart disease in the full set is: 5.401174168297456 The % of individuals with stroke in the subset is: 19.900497512437813 The % of individuals with stroke in the full set is: 4.87279843444227
When we first look at the subset data frame of individuals that did not have a BMI, it at first glance appears like a random selection of individuals from the whole data set. However, when we compare the characteristics of the individuals in the subset versus those in the full set, we realize that may not be the case. If the 201 individuals without a BMI were just random errors in the data collection/processing phase, we would assume that the ratio of males-to-females, ratio of those with hypertension, etc., would be the same when comparing both the full set and sub-set. However, above, we see that the subset has a lower ratio of females to males, a higher ratio of individuals with hypertension, a higher ratio of individuals with heart disease, and ultimately, a higher ratio of individuals that had a stroke. This is significant as deleting the subset of individuals that are skewed to have fewer females, more individuals suffering from pre-existing conditions, and more individuals who had a stroke could artificially manipulate the attribute percentage ratios in a cleaned data set without NAN values.
We should now try to determine how the attribute characteristics of individuals in the data set will be impacted if we were to choose to delete rows with NAN values rather than impute them.
df_deleting_blank_bmi= df.dropna() #creating a new set without the n/a BMI's
count_delbmi_female=len(df_deleting_blank_bmi[df_deleting_blank_bmi['gender']=="Female"])
count_delbmi_hypertension=len(df_deleting_blank_bmi[df_deleting_blank_bmi['hypertension']==1])
count_delbmi_heart_disease=len(df_deleting_blank_bmi[df_deleting_blank_bmi['heart_disease']==1])
count_delbmi_stroke=len(df_deleting_blank_bmi[df_deleting_blank_bmi['stroke']==1])
#calcualting the ratios for attributes in the set without BMI.
print('The proportion of females in the set WITHOUT bmi is:', ((count_delbmi_female/len(df_deleting_blank_bmi))*100),"out of 100")
print('The proportion of those with hypertension in the set WITHOUT bmi is:', ((count_delbmi_hypertension/len(df_deleting_blank_bmi))*100),"out of 100")
print('The proportion of those with heart disease in the set WITHOUT bmi is:', ((count_delbmi_heart_disease/len(df_deleting_blank_bmi))*100),"out of 100")
print('The proportion of those who had a stroke in the set WITHOUT bmi is:', ((count_delbmi_stroke/len(df_deleting_blank_bmi))*100),"out of 100")
print("")
print("")
print('The proportion of females in the ORIGINAL set is', ((count_fullset_female/len(df))*100),"out of 100")
print('The proportion of individuals with hypertension in the ORIGINAL set is:', ((count_fullset_hypertension/len(df))*100),"out of 100")
print('The proportion of individuals with heart disease in the ORIGINAL set is:', ((count_fullset_heart_disease/len(df))*100),"out of 100")
print('The proportion of individuals with stroke in the ORIGINAL set is:', ((count_fullset_stroke/len(df))*100),"out of 100")
print("")
print("")
#let's see the % change in the old versus new data.
print("the % change in the proportion of individuals who are female from original to new set is is: ",((((count_delbmi_female/len(df_deleting_blank_bmi))*100)-((count_fullset_female/len(df))*100))/((count_fullset_female/len(df))*100))*100,"%" )
print("the % change in the proportion of individuals with hypertension from original to new set is is:",((((count_delbmi_hypertension/len(df_deleting_blank_bmi))*100)- ((count_fullset_hypertension/len(df))*100))/ ((count_fullset_hypertension/len(df))*100))*100,"%")
print("the % change in the proportion of individuals with heart disease from original to new set is is:",((((count_delbmi_heart_disease/len(df_deleting_blank_bmi))*100)-((count_fullset_heart_disease/len(df))*100))/((count_fullset_heart_disease/len(df))*100))*100,"%" )
print("the % change in the proportion of individuals who had a stroke from original to new set is is: ",((((count_delbmi_stroke/len(df_deleting_blank_bmi))*100)- ((count_fullset_stroke/len(df))*100))/ ((count_fullset_stroke/len(df))*100))*100,"%")
The proportion of females in the set WITHOUT bmi is: 59.014055815848444 out of 100 The proportion of those with hypertension in the set WITHOUT bmi is: 9.187207170503157 out of 100 The proportion of those with heart disease in the set WITHOUT bmi is: 4.950091668364228 out of 100 The proportion of those who had a stroke in the set WITHOUT bmi is: 4.257486249745366 out of 100 The proportion of females in the ORIGINAL set is 58.590998043052835 out of 100 The proportion of individuals with hypertension in the ORIGINAL set is: 9.74559686888454 out of 100 The proportion of individuals with heart disease in the ORIGINAL set is: 5.401174168297456 out of 100 The proportion of individuals with stroke in the ORIGINAL set is: 4.87279843444227 out of 100 the % change in the proportion of individuals who are female from original to new set is is: 0.722052511351223 % the % change in the proportion of individuals with hypertension from original to new set is is: -5.729661362909371 % the % change in the proportion of individuals with heart disease from original to new set is is: -8.351563676299977 % the % change in the proportion of individuals who had a stroke from original to new set is is: -12.62749101928185 %
Just looking at the ratios in the set without BMI and comparing them to the one with BMI, we could argue that the % looks pretty similar at first glance. And looking at the results of the above calculations showing the change in the proportion of females between the two sets, we see a change of less than 1%, which we could argue is within an acceptable margin.
However, the change in the % of individuals with hypertension, heart disease, and stroke from the original set to the new set ranges from a 5% difference to more than a 12% difference. Though we could argue that the change in the proportion of individuals from the original set to the set without BMI is less than one individual for all attributes above since we are dealing with approximately 5000 individuals, the % change in the proportion suggests that even a small change in the characteristics of a population sample could influence our analysis of the questions in the Data Visualization step.
Furthermore, considering potential third parties of our analysis include doctors, clinicians, and the medical space, there is little or no margin for error as the well-being of every single patient matters. Thus, we err on the side of caution and choose to impute our data.
The next step below is to explore and visualize the BMI values that we do have in the dataset to determine (1) If we to disregard/modify existing BMI values due to possible typos and (2) what value to impute with.
df_just_bmi=df['bmi'].dropna() #create a subset so do not get NULLS descriptive statistic calculations.
print("the median is:", df_just_bmi.median())
print("The mode is: ", df_just_bmi.mode())
print("")
print(df_just_bmi.describe()) #Seeing visualize and describe together allows me to comprehend distribution.
print("")
print("The % difference between the median and the mean is: ",(df_just_bmi.mean()-df_just_bmi.median())/df_just_bmi.median()*100)
the median is: 28.1 The mode is: 0 28.7 Name: bmi, dtype: float64 count 4909.000000 mean 28.893237 std 7.854067 min 10.300000 25% 23.500000 50% 28.100000 75% 33.100000 max 97.600000 Name: bmi, dtype: float64 The % difference between the median and the mean is: 2.822907159411644
Looking at the max and minimum values for BMI, we see that we exclude the N/A values, which could artificially lower our calculation of the "true middle" of BMI values. Note that the minimum BMI is 10.3, which is typically considered strangely low (almost impossible?). I want to quickly look at this particular individual to determine if this BMI is potentially a typo (103, perhaps?) or makes sense, given the other attributes.
Looking above, we can see that the mode and the mean are nearly identical, with only a 0.2 BMI difference. However, the mean is nearly 3% larger than the median. The fact that the mean and median are different leads to wonder whether or not the graph is skewed in some fashion that causes a set of values on one side of the distribution to "pull or skew" the mean up or down from the true middle. Thus, I will look at a histogram and scatterplot of those values.
df[df['bmi']==10.3]
| gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1609 | Female | 1.24 | 0 | 0 | No | children | Rural | 122.04 | 10.3 | Unknown | 0 |
Considering the individual was a baby, the BMI value does make sense and is not a typo.
#creating a histogram of BMI and count of individuals
df.hist(column='bmi', bins=120)
plt.xticks(np.arange(0, 100, 10)) #changing the x-tick marks to be spaced by 10 units for more clarity
plt.xlabel('BMI')
plt.ylabel('Count of Individuals')
plt.title('BMI of Individuals Surveyed by CDC')
plt.axvline(df_just_bmi.mean(), color='r', linestyle='dashed', linewidth=1) #vertical line for mean
plt.axvline(df_just_bmi.median(), color='k', linestyle='dashed', linewidth=1) #vertical line for mean
plt.text(103,230,'red-line= mean',rotation=0,color='r')
plt.text(103,210,'black-line= median',rotation=0,color='k')
if 'Index' not in df:
df.insert(0, 'Index', range(1, 1+len(df)))
df[['Index','bmi']].plot(kind='scatter', x='Index', y='bmi', figsize=(20,10))
plt.axhline(y=df_just_bmi.mean(), color='r', linestyle='-')
plt.axhline(y=df_just_bmi.median(), color='g', linestyle='-')
gcd_ratio= math.gcd(df_just_bmi[df_just_bmi>df_just_bmi.mean()].count(), df_just_bmi[df_just_bmi<df_just_bmi.mean()].count())
C:\ProgramData\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\core.py:1114: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored scatter = ax.scatter(
#creating an index column to be able to have the X axis be each individual in the next scatter plot.
print("The ratio of individuals above and below the mean value is--", df_just_bmi[df_just_bmi>df_just_bmi.mean()].count()/gcd_ratio, ":",df_just_bmi[df_just_bmi<df_just_bmi.mean()].count()/gcd_ratio)
print("")
print("The ratio of individuals above and below the median value is--", df_just_bmi[df_just_bmi>df_just_bmi.median()].count()/gcd_ratio, ":",df_just_bmi[df_just_bmi<df_just_bmi.median()].count()/gcd_ratio)
The ratio of individuals above and below the mean value is-- 2212.0 : 2697.0 The ratio of individuals above and below the median value is-- 2426.0 : 2454.0
The histogram allows us to see the relative distribution of individuals based on their BMI. We can see that this histogram is skewed more to the right. When we add clear lines for the mean and median, we can see how the mean is greater than the median.df=
The scatterplot shows the same depiction as the histogram but allows us to visualize on a per-individual basis using the index value of each individual. Thus, we can more clearly see that though the mean and median might not differ greatly, it differs by enough where individuals are in between the median and mean lines. Thus, we try to find the ratio of individuals above and below both of these lines and see whether the mean or median offers a better 1:1 ratio where half are above, and half are below. The ratio calculations show that the median is a much better midpoint than the mean in this instance. Thus, we choose to impute using the median.
df.fillna(df.median(numeric_only=True), inplace=True)
Visualizing Basic Feature Distributions
df
| Index | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Male | 67.0 | 0 | 1 | Yes | Private | Urban | 228.69 | 36.6 | formerly smoked | 1 |
| 1 | 2 | Female | 61.0 | 0 | 0 | Yes | Self-employed | Rural | 202.21 | 28.1 | never smoked | 1 |
| 2 | 3 | Male | 80.0 | 0 | 1 | Yes | Private | Rural | 105.92 | 32.5 | never smoked | 1 |
| 3 | 4 | Female | 49.0 | 0 | 0 | Yes | Private | Urban | 171.23 | 34.4 | smokes | 1 |
| 4 | 5 | Female | 79.0 | 1 | 0 | Yes | Self-employed | Rural | 174.12 | 24.0 | never smoked | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5105 | 5106 | Female | 80.0 | 1 | 0 | Yes | Private | Urban | 83.75 | 28.1 | never smoked | 0 |
| 5106 | 5107 | Female | 81.0 | 0 | 0 | Yes | Self-employed | Urban | 125.20 | 40.0 | never smoked | 0 |
| 5107 | 5108 | Female | 35.0 | 0 | 0 | Yes | Self-employed | Rural | 82.99 | 30.6 | never smoked | 0 |
| 5108 | 5109 | Male | 51.0 | 0 | 0 | Yes | Private | Rural | 166.29 | 25.6 | formerly smoked | 0 |
| 5109 | 5110 | Female | 44.0 | 0 | 0 | Yes | Govt_job | Urban | 85.28 | 26.2 | Unknown | 0 |
5110 rows × 12 columns
df_for_correlation_heatmap= df.drop(columns=['Index', 'hypertension', 'heart_disease', 'stroke'])
correlation_plot= df_for_correlation_heatmap.corr()
the_plot_in_form_of_heat= sns.heatmap(correlation_plot, annot=True, vmin=-1, vmax=1)
the_plot_in_form_of_heat.set_title('Correlation of Different Combinations of Variables', pad=20, weight='bold')
Text(0.5, 1.0, 'Correlation of Different Combinations of Variables')
In visualizing basic features, the first step I do is to determine which attributes show a linear relationship with another. Looking at the visualization above, we can firstly ignore the '1' values along the diagonal, as comparing any attribute with itself will obviously have a perfect correlation, which does not glean any insight. Furthermore, it is important to note that we can drop the index column from the data frame prior to generating the matrix, as the index values are merely a means of identifying/selecting each individual, which is not necessary for creating the matrix above. Another issue to consider is the fact that comparing numeric and categorical variables to determine correlation is not possible, as the categorical variable only has two possible values (1 or 0). Thus, all categorical variables should be dropped from the graphic shown above.
We see that the strongest linear correlation shown is between age and BMI and has a correlation of 0.32. This suggests there is a weak positive correlation between age and BMI. Furthermore, age appears to be a greater predictor of BMI than average glucose level.
To determine whether a relationship between a categorical variable and a quantitative variable, I can do an ANOVA test where the categorical variable is on the x-axis and the quantitative variable is on the y-axis. Furthermore, we can determine statistically whether the categorical variable has a statistical impact on a particular quantitative variable by calculating the p-value. A statistical explanation of ANOVA and an example of it implemented in Python to be referenced when we do the box plots, and statistical hypothesis testing on our data is provided below:
Thus, the next step in better understanding the data is to determine whether a relationship is a significant difference in glucose levels depending on whether an individual has hypertension or not, heart disease or not, or has had a stroke or not. Furthermore, we could also perform similar analyses based on the work type and residence type of an individual.
boxplot_hypertension_glucose= sns.boxplot(x='hypertension', y='avg_glucose_level', data=df, color='#ba5425')
plt.show()
boxplot_heart_disease_glucose= sns.boxplot(x='heart_disease', y='avg_glucose_level', data=df, color='#ba5425')
plt.show()
boxplot_stroke_glucose= sns.boxplot(x='stroke', y='avg_glucose_level', data=df, color='#ba5425')
plt.show()
Above, we can notice that the boxplots generated for hypertension, heart disease, and stroke look virtually identical. This suggests that if there is a statistically-significant impact of stroke, heart disease, or hypertension on glucose levels, the impact on glucose levels is similar. The boxplots suggest that prior occurrence of a stroke, heart disease, and hypertension appears to relate to a significantly larger variability in glucose levels. Furthermore, the median glucose level for a person who had a stroke, heart disease, or hypertension is greater than one without. In addition, the third quartile glucose levels for those with stroke, hypertension, or heart disease, are significantly greater than the third quartile for those without. The next step will be to see if the boxplots above dictate a statistically significant difference:
#Create a new dataframe that solely has the information needed for each hypothesis test.
import scipy.stats as stats
from decimal import Decimal # Use the decimal library to print with increased precision to compare to confidence level
df_hypertension_glucose_hyp_test = df[['hypertension', 'avg_glucose_level']]
fvalue, pvalue = stats.f_oneway(df_hypertension_glucose_hyp_test['hypertension'], df_hypertension_glucose_hyp_test['avg_glucose_level'])
print("the p-value for hypertension and glucose levels is",'{:.5e}'.format(Decimal(pvalue)))
the p-value for hypertension and glucose levels is 0.00000e+5
df_heart_disease_hyp_test = df[['heart_disease', 'avg_glucose_level']]
fvalue, pvalue = stats.f_oneway(df_heart_disease_hyp_test['heart_disease'], df_heart_disease_hyp_test['avg_glucose_level'])
print("the p-value for heart disease and glucose levels is",'{:.5e}'.format(Decimal(pvalue)))
the p-value for heart disease and glucose levels is 0.00000e+5
df_stroke_test = df[['stroke', 'avg_glucose_level']]
fvalue, pvalue = stats.f_oneway(df_stroke_test['stroke'], df_stroke_test['avg_glucose_level'])
print("the p-value for stroke and glucose levels is",'{:.5e}'.format(Decimal(pvalue)))
the p-value for stroke and glucose levels is 0.00000e+5
For Each of the tests above:
1.
H0--> There are no significant differences in glucose levels regardless of the presence of hypertension.
HA-> There is a significant difference in glucose levels based on the presence of hypertension
because 0.00<0.05, we reject the null hypothesis. There is a statistically significant difference based on hypertension.
2.
H0--> There are no significant differences in glucose levels regardless of the presence of heart disease.
HA-> There is a significant difference in glucose levels based on the presence of heart disease
because 0.00<0.05, we reject the null hypothesis. There is a statistically significant difference based on heart disease.
3.
H0--> There are no significant differences in glucose levels regardless of an individual having had a stroke.
HA-> There is a significant difference in glucose levels based on the presence of heart disease based on the presence of a stroke
because 0.00<0.05, we reject the null hypothesis. There is a statistically significant difference based on stroke.
//Note: To print the p-value with increased precision above, I referred to the example use of the Decimal library on Stack Overflow
boxplot_work_type_glucose= sns.boxplot(x='work_type', y='avg_glucose_level', data=df, color='#ba5425')
plt.show()
boxplot_residence_type_glucose= sns.boxplot(x='Residence_type', y='avg_glucose_level', data=df, color='#ba5425')
plt.show()
violin_work_type_glucose= sns.violinplot(x='work_type', y='avg_glucose_level', data=df, color='#ba5425')
plt.show()
violin_residence_type_glucose= sns.violinplot(x='Residence_type', y='avg_glucose_level', data=df, color='#ba5425')
plt.show()
For average glucose levels based on the resident type and work type, I decided to use both a box plot and a violin plot due to a large number of outliers. Thus, I can see clearly the median and quartiles provided by a boxplot while also seeing a more detailed distribution of the data with outliers from the violin plots.
Regarding residence types, we see that the distribution of glucose levels appears identical regardless of residence. This surprised me, as from prior research, individuals in rural areas were shown to have actually done less physical activity than those in urban areas, which leads to higher amounts of heart disease and obesity in rural areas of the USA. Thus, higher glucose levels would be expected for rural residents.
For the bar chart and violin chart for jobs, we see that though self-employed people may have higher variance in the distribution of glucose levels, there appears to be an insignificant difference in median glucose levels.
Question 1: Is there a significant difference in Glucose levels as related to the past occurrence of a stroke based on the age of individuals?
For the question above, we want to know whether a person who had a stroke or not leads to a significant difference in an individual's glucose level. However, we also want to account for the various ages as well. To solve this problem, we must find the difference in the means between the glucose and Age attributes while controlling the Stroke categorical variable.
The first step in this instance will be to visualize the impact of stroke, and different age ranges on glucose levels through plots.
Then, we must perform an ANCOVA test to confirm what we are seeing. In this instance, stroke is a factor variable, age is the covariate, and glucose is the response variable. To perform this test, I will refer to the website below that provides an example with Python code on how this test works:
from pingouin import ancova
import plotly.express as express #NOTE: THis plotly.express boxplot is better as I can hover my mouse over the boxplot and quartles, median and other relevant info will pop up.
df_for_Q1= df[['stroke','age', 'avg_glucose_level']]
#break up age column to multiple columns based on different age ranges.
agelesseq10=df[df['age']<=10]
agelesseq10=agelesseq10[['stroke','avg_glucose_level','age']]
agelesseq30=df[(df['age']>10) & (df['age']<=30)]
agelesseq30=agelesseq30[['stroke','avg_glucose_level','age']]
agelesseq50=df[(df['age']>30) & (df['age']<=50)]
agelesseq50=agelesseq50[['stroke','avg_glucose_level','age']]
agelesseq70=df[(df['age']>50) & (df['age']<=70)]
agelesseq70=agelesseq70[['stroke','avg_glucose_level','age']]
agelesseq110=df[(df['age']>70) & (df['age']<=110)]
agelesseq110=agelesseq110[['stroke','avg_glucose_level','age']]
express.box(agelesseq10, y='avg_glucose_level', x='stroke', color='stroke', width=500, height=400)
#ancova(data=df_for_Q1, dv='bmi', covar='age', between='stroke')
express.box(agelesseq30, y='avg_glucose_level', color='stroke', width=500, height=400)
fiftyfig1= express.box(agelesseq50, y='avg_glucose_level', color='stroke', width=500, height=400);
fiftyfig2=express.violin(agelesseq50, y='avg_glucose_level', color='stroke', width=500, height=400)
fiftyfig1.show()
fiftyfig2.show()
For the boxplots for individuals with ages less than ten and for the age range 10<age<=30 , we do not see any individual that has gotten a stroke at those ages. The median glucose levels for both those boxplots is also similar at around 90. Furthermore, quartiles one and quartiles three are also similarly placed for both those age ranges (about 75 and 105, respectively.)
And when looking at the age range from 30-50, we see a very similar distribution for those with and without stroke. The violin plot also shows us that the distribution is similar, with a tall upper peak that are outlier values.
express.box(agelesseq70, y='avg_glucose_level', color='stroke', width=500, height=400)
express.box(agelesseq110, y='avg_glucose_level', color='stroke', width=500, height=400)
For the age range 50-70, we see a major shift in the variance in the glucose levels of those who suffered a stroke, with a much higher quartile, 3, and max. The difference in the median between those who suffered a stroke and those who didn't also increase to around 20 units. And when looking at individuals from 70 and up, we see that the variation in glucose levels for those who didn't have a stroke also increases significantly, with a greater proportion of individuals having abnormally high glucose levels.
ancova(data=df_for_Q1, dv='avg_glucose_level', covar='age', between='stroke')
| Source | SS | DF | F | p-unc | np2 | |
|---|---|---|---|---|---|---|
| 0 | stroke | 6.027184e+04 | 1 | 31.338777 | 2.280240e-08 | 0.006099 |
| 1 | age | 4.721654e+05 | 1 | 245.505817 | 4.427654e-54 | 0.045867 |
| 2 | Residual | 9.821962e+06 | 5107 | NaN | NaN | NaN |
The p-values of approximately 0 for stroke and age are less than the 0.05 confidence interval. We thus can reject the notion that an individual having or not having had a stroke does not influence avg_glucose levels, even when controlling the age attribute.
the np2 (or partial eta squared) value is also interesting as it suggests that the "scale" of the effect of age on glucose levels is modest at 0.04, while stroke has less of an impact on an individual's glucose level. Thus, age would appear to be a greater determinant of glucose levels than having had a stroke. This is significant as a doctor looking at patients might want to prioritize an older individual, even if their diet and lifestyle are healthier, as being at greater risk of stroke versus a person who is younger but has a less healthy diet with less exercise that is known to reduce glucose levels.
Question 2: How does a person's smoking status influence the likelihood of a stroke occurring?
For this question, the first step is actually to determine what to do about individuals who has an "unknown" smoking status. This accounts for approximately 30% of all individuals in the data set. In this instance, it is better to drop rather than impute, as there is no way to guess whether an individual has smoked or not. We are aware of the risk of skewing the data. For instance, an individual who was an "unknown" smoked status might consist of more people who smoked but felt uncomfortable disclosing that information due to privacy.
#create df based on each smoking status type besides the "unknown" category
df_q2_formerly_smoked = df[df['smoking_status'] == "formerly smoked"]
df_q2_never_smoked= df[df['smoking_status']=="never smoked"]
df_q2_smokes=df[df['smoking_status'] == "smokes"]
percentage_of_strokes1= (df_q2_formerly_smoked['stroke'].mean()) * 100
percentage_of_no_strokes1=100-percentage_of_strokes1
percentage_of_strokes2= (df_q2_never_smoked['stroke'].mean()) * 100
percentage_of_no_strokes2=100-percentage_of_strokes2
percentage_of_strokes3= (df_q2_smokes['stroke'].mean()) * 100
percentage_of_no_strokes3=100-percentage_of_strokes3
print("the percentage of individuals who formerly smoked with stroke is",round(percentage_of_strokes1,3))
print("the percentage of individuals who formerly smoked without stroke is",round(percentage_of_no_strokes1,3))
print("the percentage of individuals who never smoked with stroke is",round(percentage_of_strokes2,3))
print("the percentage of individuals who never smoked without stroke is",round(percentage_of_no_strokes2,3))
print("the percentage of individuals who smokes with stroke is",round(percentage_of_strokes3,3))
print("the percentage of individuals who smokes without stroke is",round(percentage_of_no_strokes3,3))
fig = plt.figure(figsize=(15,5))
stacked_bar = pd.crosstab([df_q2_formerly_smoked["smoking_status"]], df_q2_formerly_smoked.stroke.astype(bool), normalize='index')
stacked_bar.plot(kind='barh', stacked=True, title='stroke ratio for those who formerly smoked')
plt.show()
fig = plt.figure(figsize=(15,5))
stacked_bar = pd.crosstab([df_q2_never_smoked["smoking_status"]], df_q2_never_smoked.stroke.astype(bool), normalize='index')
stacked_bar.plot(kind='barh', stacked=True, title='stroke ratio for those who never smoked')
plt.show()
plt.show()
fig = plt.figure(figsize=(15,5))
stacked_bar = pd.crosstab([df_q2_smokes["smoking_status"]], df_q2_smokes.stroke.astype(bool), normalize='index')
stacked_bar.plot(kind='barh', stacked=True, title='stroke ratio for those smoke')
plt.show()
the percentage of individuals who formerly smoked with stroke is 7.91 the percentage of individuals who formerly smoked without stroke is 92.09 the percentage of individuals who never smoked with stroke is 4.757 the percentage of individuals who never smoked without stroke is 95.243 the percentage of individuals who smokes with stroke is 5.323 the percentage of individuals who smokes without stroke is 94.677
<Figure size 1500x500 with 0 Axes>
<Figure size 1500x500 with 0 Axes>
<Figure size 1500x500 with 0 Axes>
The results above suggest that strokes occur more frequently in individuals who previously smoked, as demonstrated by the 8% of individuals in that category that suffered a stroke. Though individuals who have never smoked have the smallest ratio of those who suffered a stroke, the difference in % when comparing to the ratio for those who smoke currently is less than 1%. This is a significant insight as it suggests that even if one stops smoking, they may still have permanent/long-term effects internally caused by smoking that increases their risk of stroke throught their lifespan. Thus, instead of merely persuading patients who already smoke to quit, doctors should also find patients who may have just began to smoke and convince them to stop immediately. Furthermore, increased emphasis should be placed on ensuring that an individual who has never smoked, is never pushed or convinced to start. Thus, health organizations on a local and national level may want to increase education initiatives.
Question 3: How do the ages compare between people who have had strokes and the people who haven't? From the histograms below we can observe that out of the patients who have had stroke, the majority of them are over 40 years old. We can also observe that the average patient who has not had stroke is slightly younger, and the patients who have had stroke are significantly older on average. There seems to be a positive correlation between age and the frequency of patients who have had stroke. This implicates that as patients get older, the more patients get stroke.
import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(15,5))
df_nostroke = df[df["stroke"] == 0]
df_stroke = df[df["stroke"] == 1]
plt.subplot(1,3,1)
df.age.plot.hist(bins=30, title="Age of all patients")
plt.axvline(x=df.age.mean(), color='r', linestyle='-')
plt.subplot(1,3,2)
df_nostroke.age.plot.hist(bins=30, title="Age of patients who have NOT had stroke")
plt.axvline(x=df_nostroke.age.mean(), color='r', linestyle='-')
plt.subplot(1,3,3)
df_stroke.age.plot.hist(bins=30, title="Age of patients who have had stroke")
plt.axvline(x=df_stroke.age.mean(), color='r', linestyle='-')
plt.show()
Does heart disease and/or hypertension increase the likelihood of stroke? From the figure below we can observe that:
fig = plt.figure(figsize=(15,5))
# we can use a cross tabulation with heart disease and hypertension
# setting normalize to index allows us to see their percentages rather than amounts
hd_ht = pd.crosstab([df["heart_disease"], df["hypertension"]], df.stroke.astype(bool), normalize='index')
hd_ht.plot(kind='barh', stacked=True)
plt.show()
<Figure size 1500x500 with 0 Axes>